feat(academy): add advanced crawling section with sitemaps and search #1217

metalwarrior665 · 2024-09-16T22:19:27Z

No description provided.

metalwarrior665 · 2024-09-17T12:54:11Z

Will process the lint issues soon

metalwarrior665 · 2024-09-18T10:40:46Z

@TC-MO If we change URL of an article, do I need to contact web team to set a hard redirect?

TC-MO · 2024-09-18T11:09:59Z

I think we do redirects in nginx.conf file not sure if there is any other way

metalwarrior665 · 2024-09-18T11:12:09Z

TODO redirect
..._web_scraping/scraping_paginated_sites.md → ...scraping/crawling/crawling-with-search.md

TC-MO · 2024-09-19T12:27:31Z

sources/academy/tutorials/node_js/scraping_from_sitemaps.md

@@ -9,6 +9,16 @@ import Example from '!!raw-loader!roa-loader!./scraping_from_sitemaps.js';

 # How to scrape from sitemaps {#scraping-with-sitemaps}

+>Crawlee recently introduced a new feature that allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code:


This feels like unnecesary dating ? When is "recently" ? Also I think this could work better as admonitions instead of blockquote.

TC-MO · 2024-09-19T12:31:49Z

sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md

+---
+title: Crawling sitemaps
+description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.
+menuWeight: 2


Is this something that is custom created by Apify? I haven't seen this anywhere else

TC-MO · 2024-09-19T12:33:20Z

sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md

+title: Crawling sitemaps
+description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.
+menuWeight: 2
+paths:


isn't this supposed to be slug: ?

TC-MO · 2024-09-19T12:35:06Z

sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md

+
+Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you so that you don't have to check manually.
+
+## [](#how-to-set-up-http-requests-to-download-sitemaps) How to set up HTTP requests to download sitemaps


if anchors do not differ from headings then these are unnecessary from what I remember

TC-MO · 2024-09-19T12:37:03Z

sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md

+
+Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev) which has rich traversing and parsing support for sitemap. Crawlee can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all URLs in a few lines of code:
+
+```javascript


could we switch it to ```js? Sometime back we changed this for consistency across Academy & Platform docs. I'll add this info to contributing guidelines.

TC-MO · 2024-09-19T12:40:53Z

sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md

+- advanced-web-scraping/crawling/sitemaps-vs-search
+---
+
+The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course.


I'm not entirely sure if home page is correct, perhaps @TheoVasilis could weigh in? Home page or homepage?

TC-MO · 2024-09-19T12:41:25Z

sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md

+
+The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course.
+
+Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 thousand products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.


Suggested change

Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 thousand products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.

Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.

TC-MO · 2024-09-19T12:43:45Z

sources/academy/webscraping/advanced_web_scraping/index.md

 category: web scraping & automation
-slug: /advanced-web-scraping
+paths:
+- advanced-web-scraping
 ---

 # Advanced web scraping


If title in frontmatter does not differ from h1, h1 is unnecessary it will be automatically generated by docusaurus

TC-MO · 2024-09-19T12:45:40Z

sources/academy/webscraping/advanced_web_scraping/index.md

---
+In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper.
+
+## [](#what-does-production-ready-mean) What does production-ready mean?


If I remember correctly, headers should not use punctuation

honzajavorek · 2024-10-07T11:27:40Z

Will review, but I think I will wait for @TC-MO's comments to be addressed first.

feat(academy): add advanced crawling section with sitemaps and search

fada6bf

metalwarrior665 requested review from souravjain540 and TC-MO September 16, 2024 22:19

metalwarrior665 requested a review from honzajavorek as a code owner September 16, 2024 22:19

resolve conflict

ddd3a3b

fnesveda added the t-academy Issues related to Web Scraping and Apify academies. label Sep 18, 2024

metalwarrior665 added 3 commits September 18, 2024 15:34

fix(academy): try to resolve bad links and redirect in nginx

a6f7150

more link fixes

58f576f

another try

9d531da

TC-MO requested changes Sep 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(academy): add advanced crawling section with sitemaps and search #1217

feat(academy): add advanced crawling section with sitemaps and search #1217

metalwarrior665 commented Sep 16, 2024

metalwarrior665 commented Sep 17, 2024 •

edited

Loading

metalwarrior665 commented Sep 18, 2024

TC-MO commented Sep 18, 2024

metalwarrior665 commented Sep 18, 2024

TC-MO Sep 19, 2024

TC-MO Sep 19, 2024

TC-MO Sep 19, 2024

TC-MO Sep 19, 2024

TC-MO Sep 19, 2024

TC-MO Sep 19, 2024

TC-MO Sep 19, 2024

TC-MO Sep 19, 2024

TC-MO Sep 19, 2024

honzajavorek commented Oct 7, 2024

		@@ -9,6 +9,16 @@ import Example from '!!raw-loader!roa-loader!./scraping_from_sitemaps.js';

		# How to scrape from sitemaps {#scraping-with-sitemaps}

		>Crawlee recently introduced a new feature that allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code:


		Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you so that you don't have to check manually.

		## [](#how-to-set-up-http-requests-to-download-sitemaps) How to set up HTTP requests to download sitemaps


		Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev) which has rich traversing and parsing support for sitemap. Crawlee can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all URLs in a few lines of code:

		```javascript


		The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course.

		Unfortunately, most modern websites restrict pagination only to somewhere between 1 and 10 thousand products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.

feat(academy): add advanced crawling section with sitemaps and search #1217

Are you sure you want to change the base?

feat(academy): add advanced crawling section with sitemaps and search #1217

Conversation

metalwarrior665 commented Sep 16, 2024

metalwarrior665 commented Sep 17, 2024 • edited Loading

metalwarrior665 commented Sep 18, 2024

TC-MO commented Sep 18, 2024

metalwarrior665 commented Sep 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

honzajavorek commented Oct 7, 2024

metalwarrior665 commented Sep 17, 2024 •

edited

Loading